In this paper, we propose a CNN-based framework for online MOT. Thisframework utilizes the merits of single object trackers in adapting appearancemodels and searching for target in the next frame. Simply applying singleobject tracker for MOT will encounter the problem in computational efficiencyand drifted results caused by occlusion. Our framework achieves computationalefficiency by sharing features and using ROI-Pooling to obtain individualfeatures for each target. Some online learned target-specific CNN layers areused for adapting the appearance model for each target. In the framework, weintroduce spatial-temporal attention mechanism (STAM) to handle the driftcaused by occlusion and interaction among targets. The visibility map of thetarget is learned and used for inferring the spatial attention map. The spatialattention map is then applied to weight the features. Besides, the occlusionstatus can be estimated from the visibility map, which controls the onlineupdating process via weighted loss on training samples with different occlusionstatuses in different frames. It can be considered as temporal attentionmechanism. The proposed algorithm achieves 34.3% and 46.0% in MOTA onchallenging MOT15 and MOT16 benchmark dataset respectively.
展开▼